34 research outputs found

    An Urdu semantic tagger - lexicons, corpora, methods and tools

    Get PDF
    Extracting and analysing meaning-related information from natural language data has attracted the attention of researchers in various fields, such as Natural Language Processing (NLP), corpus linguistics, data sciences, etc. An important aspect of such automatic information extraction and analysis is the semantic annotation of language data using semantic annotation tool (a.k.a semantic tagger). Generally, different semantic annotation tools have been designed to carry out various levels of semantic annotations, for instance, sentiment analysis, word sense disambiguation, content analysis, semantic role labelling, etc. These semantic annotation tools identify or tag partial core semantic information of language data, moreover, they tend to be applicable only for English and other European languages. A semantic annotation tool that can annotate semantic senses of all lexical units (words) is still desirable for the Urdu language based on USAS (the UCREL Semantic Analysis System) semantic taxonomy, in order to provide comprehensive semantic analysis of Urdu language text. This research work report on the development of an Urdu semantic tagging tool and discuss challenging issues which have been faced in this Ph.D. research work. Since standard NLP pipeline tools are not widely available for Urdu, alongside the Urdu semantic tagger a suite of newly developed tools have been created: sentence tokenizer, word tokenizer and part-of-speech tagger. Results for these proposed tools are as follows: word tokenizer reports F1F_1 of 94.01\%, and accuracy of 97.21\%, sentence tokenizer shows F1_1 of 92.59\%, and accuracy of 93.15\%, whereas, POS tagger shows an accuracy of 95.14\%. The Urdu semantic tagger incorporates semantic resources (lexicon and corpora) as well as semantic field disambiguation methods. In terms of novelty, the NLP pre-processing tools are developed either using rule-based, statistical, or hybrid techniques. Furthermore, all semantic lexicons have been developed using a novel combination of automatic or semi-automatic approaches: mapping, crowdsourcing, statistical machine translation, GIZA++, word embeddings, and named entity. A large multi-target annotated corpus is also constructed using a semi-automatic approach to test accuracy of the Urdu semantic tagger, proposed corpus is also used to train and test supervised multi-target Machine Learning classifiers. The results show that Random k-labEL Disjoint Pruned Sets and Classifier Chain multi-target classifiers outperform all other classifiers on the proposed corpus with a Hamming Loss of 0.06\% and Accuracy of 0.94\%. The best lexical coverage of 88.59\%, 99.63\%, 96.71\% and 89.63\% are obtained on several test corpora. The developed Urdu semantic tagger shows encouraging precision on the proposed test corpus of 79.47\%

    Semantic Tagging for the Urdu Language:Annotated Corpus and Multi-Target Classification Methods

    Get PDF
    Extracting and analysing meaning-related information from natural language data has attracted the attention of researchers in various fields, such as natural language processing, corpus linguistics, information retrieval, and data science. An important aspect of such automatic information extraction and analysis is the annotation of language data using semantic tagging tools. Different semantic tagging tools have been designed to carry out various levels of semantic analysis, for instance, named entity recognition and disambiguation, sentiment analysis, word sense disambiguation, content analysis, and semantic role labelling. Common to all of these tasks, in the supervised setting, is the requirement for a manually semantically annotated corpus, which acts as a knowledge base from which to train and test potential word and phrase-level sense annotations. Many benchmark corpora have been developed for various semantic tagging tasks, but most are for English and other European languages. There is a dearth of semantically annotated corpora for the Urdu language, which is widely spoken and used around the world. To fill this gap, this study presents a large benchmark corpus and methods for the semantic tagging task for the Urdu language. The proposed corpus contains 8,000 tokens in the following domains or genres: news, social media, Wikipedia, and historical text (each domain having 2K tokens). The corpus has been manually annotated with 21 major semantic fields and 232 sub-fields with the USAS (UCREL Semantic Analysis System) semantic taxonomy which provides a comprehensive set of semantic fields for coarse-grained annotation. Each word in our proposed corpus has been annotated with at least one and up to nine semantic field tags to provide a detailed semantic analysis of the language data, which allowed us to treat the problem of semantic tagging as a supervised multi-target classification task. To demonstrate how our proposed corpus can be used for the development and evaluation of Urdu semantic tagging methods, we extracted local, topical and semantic features from the proposed corpus and applied seven different supervised multi-target classifiers to them. Results show an accuracy of 94% on our proposed corpus which is free and publicly available to download

    Lexical coverage evaluation of large-scale multilingual semantic lexicons for twelve languages

    Get PDF
    The last two decades have seen the development of various semantic lexical resources such as WordNet (Miller, 1995) and the USAS semantic lexicon (Rayson et al., 2004), which have played an important role in the areas of natural language processing and corpus-based studies. Recently, increasing efforts have been devoted to extending the semantic frameworks of existing lexical knowledge resources to cover more languages, such as EuroWordNet and Global WordNet. In this paper, we report on the construction of large-scale multilingual semantic lexicons for twelve languages, which employ the unified Lancaster semantic taxonomy and provide a multilingual lexical knowledge base for the automatic UCREL semantic annotation system (USAS). Our work contributes towards the goal of constructing larger-scale and higher-quality multilingual semantic lexical resources and developing corpus annotation tools based on them. Lexical coverage is an important factor concerning the quality of the lexicons and the performance of the corpus annotation tools, and in this experiment we focus on evaluating the lexical coverage achieved by the multilingual lexicons and semantic annotation tools based on them. Our evaluation shows that some semantic lexicons such as those for Finnish and Italian have achieved lexical coverage of over 90% while others need further expansion

    Epidemiology, practice of ventilation and outcome for patients at increased risk of postoperative pulmonary complications

    Get PDF
    BACKGROUND Limited information exists about the epidemiology and outcome of surgical patients at increased risk of postoperative pulmonary complications (PPCs), and how intraoperative ventilation was managed in these patients. OBJECTIVES To determine the incidence of surgical patients at increased risk of PPCs, and to compare the intraoperative ventilation management and postoperative outcomes with patients at low risk of PPCs. DESIGN This was a prospective international 1-week observational study using the ‘Assess Respiratory Risk in Surgical Patients in Catalonia risk score’ (ARISCAT score) for PPC for risk stratification. PATIENTS AND SETTING Adult patients requiring intraoperative ventilation during general anaesthesia for surgery in 146 hospitals across 29 countries. MAIN OUTCOME MEASURES The primary outcome was the incidence of patients at increased risk of PPCs based on the ARISCAT score. Secondary outcomes included intraoperative ventilatory management and clinical outcomes. RESULTS A total of 9864 patients fulfilled the inclusion criteria. The incidence of patients at increased risk was 28.4%. The most frequently chosen tidal volume (VT) size was 500 ml, or 7 to 9 ml kg1 predicted body weight, slightly lower in patients at increased risk of PPCs. Levels of positive end-expiratory pressure (PEEP) were slightly higher in patients at increased risk of PPCs, with 14.3% receiving more than 5 cmH2O PEEP compared with 7.6% in patients at low risk of PPCs (P < 0.001). Patients with a predicted preoperative increased risk of PPCs developed PPCs more frequently: 19 versus 7%, relative risk (RR) 3.16 (95% confidence interval 2.76 to 3.61), P < 0.001) and had longer hospital stays. The only ventilatory factor associated with the occurrence of PPCs was the peak pressure. CONCLUSION The incidence of patients with a predicted increased risk of PPCs is high. A large proportion of patients receive high VT and low PEEP levels. PPCs occur frequently in patients at increased risk, with worse clinical outcome

    Prognostic model to predict postoperative acute kidney injury in patients undergoing major gastrointestinal surgery based on a national prospective observational cohort study.

    Get PDF
    Background: Acute illness, existing co-morbidities and surgical stress response can all contribute to postoperative acute kidney injury (AKI) in patients undergoing major gastrointestinal surgery. The aim of this study was prospectively to develop a pragmatic prognostic model to stratify patients according to risk of developing AKI after major gastrointestinal surgery. Methods: This prospective multicentre cohort study included consecutive adults undergoing elective or emergency gastrointestinal resection, liver resection or stoma reversal in 2-week blocks over a continuous 3-month period. The primary outcome was the rate of AKI within 7 days of surgery. Bootstrap stability was used to select clinically plausible risk factors into the model. Internal model validation was carried out by bootstrap validation. Results: A total of 4544 patients were included across 173 centres in the UK and Ireland. The overall rate of AKI was 14·2 per cent (646 of 4544) and the 30-day mortality rate was 1·8 per cent (84 of 4544). Stage 1 AKI was significantly associated with 30-day mortality (unadjusted odds ratio 7·61, 95 per cent c.i. 4·49 to 12·90; P < 0·001), with increasing odds of death with each AKI stage. Six variables were selected for inclusion in the prognostic model: age, sex, ASA grade, preoperative estimated glomerular filtration rate, planned open surgery and preoperative use of either an angiotensin-converting enzyme inhibitor or an angiotensin receptor blocker. Internal validation demonstrated good model discrimination (c-statistic 0·65). Discussion: Following major gastrointestinal surgery, AKI occurred in one in seven patients. This preoperative prognostic model identified patients at high risk of postoperative AKI. Validation in an independent data set is required to ensure generalizability

    Epidemiology, practice of ventilation and outcome for patients at increased risk of postoperative pulmonary complications: LAS VEGAS - An observational study in 29 countries

    Get PDF
    BACKGROUND Limited information exists about the epidemiology and outcome of surgical patients at increased risk of postoperative pulmonary complications (PPCs), and how intraoperative ventilation was managed in these patients. OBJECTIVES To determine the incidence of surgical patients at increased risk of PPCs, and to compare the intraoperative ventilation management and postoperative outcomes with patients at low risk of PPCs. DESIGN This was a prospective international 1-week observational study using the ‘Assess Respiratory Risk in Surgical Patients in Catalonia risk score’ (ARISCAT score) for PPC for risk stratification. PATIENTS AND SETTING Adult patients requiring intraoperative ventilation during general anaesthesia for surgery in 146 hospitals across 29 countries. MAIN OUTCOME MEASURES The primary outcome was the incidence of patients at increased risk of PPCs based on the ARISCAT score. Secondary outcomes included intraoperative ventilatory management and clinical outcomes. RESULTS A total of 9864 patients fulfilled the inclusion criteria. The incidence of patients at increased risk was 28.4%. The most frequently chosen tidal volume (V T) size was 500 ml, or 7 to 9 ml kg−1 predicted body weight, slightly lower in patients at increased risk of PPCs. Levels of positive end-expiratory pressure (PEEP) were slightly higher in patients at increased risk of PPCs, with 14.3% receiving more than 5 cmH2O PEEP compared with 7.6% in patients at low risk of PPCs (P ˂ 0.001). Patients with a predicted preoperative increased risk of PPCs developed PPCs more frequently: 19 versus 7%, relative risk (RR) 3.16 (95% confidence interval 2.76 to 3.61), P ˂ 0.001) and had longer hospital stays. The only ventilatory factor associated with the occurrence of PPCs was the peak pressure. CONCLUSION The incidence of patients with a predicted increased risk of PPCs is high. A large proportion of patients receive high V T and low PEEP levels. PPCs occur frequently in patients at increased risk, with worse clinical outcome.</p
    corecore